141 research outputs found
Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture
This paper describes a low-power processor tailored for fast Fourier
transform computations where transport triggering template is exploited. The
processor is software-programmable while retaining an energy-efficiency
comparable to existing fixed-function implementations. The power savings are
achieved by compressing the computation kernel into one instruction word. The
word is stored in an instruction loop buffer, which is more power-efficient
than regular instruction memory storage. The processor supports all
power-of-two FFT sizes from 64 to 16384 and given 1 mJ of energy, it can
compute 20916 transforms of size 1024.Comment: 5 pages, 4 figures, 1 table, ICASSP 2019 conferenc
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Memory-Based FFT Architecture with Optimized Number of Multiplexers and Memory Usage
This brief presents a new P-parallel radix-2 memory-based fast Fourier transform (FFT) architecture. The aim of this work is to reduce the number of multiplexers and achieve an efficient memory usage. One advantage of the proposed architecture is that it only needs permutation circuits after the memories, which reduces the multiplexer usage to only one multiplexer per parallel branch. Another advantage is that the architecture calculates the same permutation based on the perfect shuffle at each iteration. Thus, the shuffling circuits do not need to be configured for different iterations. In fact, all the memories require the same read and write addresses, which simplifies the control even further and allows to merge the memories. Along with the hardware efficiency, conflict-free memory access is fulfilled by a circular counter. The FFT has been implemented on a field programmable gate array. Compared to previous approaches, the proposed architecture has the least number of multiplexers and achieves very low area usage.publishedVersionPeer reviewe
Efficient Software Synthesis of Dynamic Dataflow Programs
International audienceThis paper introduces advanced software synthesis techniques that enhance the implementation of dynamic dataflow programs. These techniques have been implemented into open-source tools and demonstrated on well-known video decoders including one based on the new High Efficiency Video Coding (HEVC) standard. The results show an improvement of more than 100% of the frame-rate over previously proposed implementations, and achieve real-time decoding of high definition video sequences
OpenCL-based design methodology for application-specific processors
Abstract-OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language. This paper describes methodology and compiler techniques involved in applying OpenCL as an input language for a design flow of application-specific processors. At the core of the methodology is a whole program optimizing compiler that links together the host and kernel codes of the input OpenCL program and parallelizes the result on a customized statically scheduled processor. The OpenCL vendor extension mechanism is used to provide clean access to custom operations. The methodology is studied with a design case to verify the scalability of the implementation at the instruction level and to exemplify the use of custom operations. The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity
Sonification of Markov-chain Monte Carlo Simulations
Hermann T, Hansen MH, Ritter H. Sonification of Markov-chain Monte Carlo Simulations. In: Hiipakka J, Zacharov N, Takala T, eds. Proceedings of 7th International Conference on Auditory Display. Helsinki University of Technology: Laboratory of Acoustics and Audio Signal Processing and the Telecommunications Software and Multimedia Laboratory; 2001: 208-216.Markov chain Monte Carlo (McMC) simulation is a popular computational tool for making inferences from complex, high-dimensional probability densities. Given a particular target density , the idea behind this technique is to simulate a Markov chain that has as its stationary distribution. To be successful, the chain needs to be run long enough so that the distribution of the current draw is close to the target density. Unfortunately, very few diagnostic tools exist to monitor characteristics of the chain. In this paper, we present a new approach to render sonifications of McMC simulations. The proposed method consists of several auditory streams which provide information about the behavior of the Markov chain. In particular, we focus on uncovering modes in the target density function. In addition to monitoring, we have found our sonification to be an effective means for understanding the structure of high-dimensional densities. We have also applied our method to the exploratory analysis of high-dimensional data sets. In this case, we take as our target a non-parametric density estimate obtained from the data. In this paper, we present a detailed description of our sonification design and illustrate its performance on test cases consisting of both synthetic and real-world data sets. Sound examples are also given
New Identical Radix-2^k Fast Fourier Transform Algorithms
The radix-2k fast Fourier transform (FFT) algorithm is used to achieve at the same time both a radix-2 butterfly and a reduced number of twiddle factor multiplication. In this paper we present a new identical radix-2^k FFT algorithms, which has same number of butterfly and twiddle factor multiplication. The difference is only in twiddle factor stage location in signal flow graph (SFG). Further, analyze these algorithms and is shown that the round-off noise of identical radix-22, radix-23, and radix-24 FFT algorithms at output is reduced 27%, 8%, 3% respectively.acceptedVersionPeer reviewe
- …